Skip to content

[SPARK-56451][DOCS] Document how SDP datasets are stored and refreshed#55277

Open
moomindani wants to merge 1 commit intoapache:masterfrom
moomindani:sdp-doc-storage-refresh
Open

[SPARK-56451][DOCS] Document how SDP datasets are stored and refreshed#55277
moomindani wants to merge 1 commit intoapache:masterfrom
moomindani:sdp-doc-storage-refresh

Conversation

@moomindani
Copy link
Copy Markdown
Contributor

@moomindani moomindani commented Apr 9, 2026

What changes were proposed in this pull request?

Add a new "How Datasets are Stored and Refreshed" section to the Spark Declarative Pipelines programming guide. This section covers:

  • Table Format: Default format (parquet via spark.sql.sources.default) and how to specify a different format with Python and SQL examples
  • How Materialized Views are Refreshed: Full recomputation (TRUNCATE + append) on every pipeline run, and how this differs from database-native materialized views
  • How Streaming Tables are Refreshed: Incremental processing with checkpoints and schema evolution support
  • Full Refresh: Behavior differences between materialized views and streaming tables

Why are the changes needed?

The current programming guide explains how to define datasets but does not explain how they are stored or refreshed. Users need to understand:

  • What format their tables are stored in by default
  • That materialized views perform a full recomputation on every run (unlike PostgreSQL-style MVs)
  • That streaming tables require checkpoint storage on a Hadoop-compatible file system
  • What --full-refresh actually does for each dataset type

Without this information, users cannot make informed decisions about table formats, storage configurations, or pipeline performance.

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Documentation change only. Verified the content is accurate by reading the SDP implementation (DatasetManager.scala, FlowExecution.scala).

Was this patch authored or co-authored using generative AI tooling?

Yes.

@moomindani moomindani force-pushed the sdp-doc-storage-refresh branch from 5763f95 to 723afa3 Compare April 9, 2026 08:08
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @moomindani .

Apache Spark community uses JIRA IDs for bug tracking. Your PR title is wrong.

Screenshot 2026-04-10 at 05 51 21

SPARK-55276 is SPARK-55276 Upgrade scala-maven-plugin to 4.9.9.

@moomindani moomindani changed the title [SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed [SPARK-56451][DOCS] Document how SDP datasets are stored and refreshed Apr 11, 2026
Add a new section to the Spark Declarative Pipelines programming guide
that explains the storage and refresh mechanics, including:
- Default table format and how to specify a different format
- How materialized views are refreshed (full recomputation via TRUNCATE + append)
- How streaming tables are refreshed (incremental processing with checkpoints)
- Full refresh behavior for both dataset types
@moomindani moomindani force-pushed the sdp-doc-storage-refresh branch from 723afa3 to 4c09d8f Compare April 11, 2026 00:49
@moomindani
Copy link
Copy Markdown
Contributor Author

Thank you for pointing that out, @dongjoon-hyun. I've updated the PR title and commit message to use the correct JIRA ID: SPARK-56451. The GitHub issue has been closed.

Copy link
Copy Markdown
Contributor

@jaceklaskowski jaceklaskowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with some tiny changes)

</div>
</div>

SDP itself does not restrict which table formats can be used. However, the table format must be supported by the configured catalog. For example, a Delta catalog only supports Delta tables, while the default session catalog supports Parquet, ORC, and other built-in formats.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a "catalog" here? Table formats are set up via packages on command line when Spark Connect server's started.


This means that every refresh is a **full recomputation** - there is no incremental or differential update. For tables with large amounts of data, be aware that each pipeline run will reprocess the entire dataset.

Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation.
Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation (e.g., Delta Lake).

2. New data is appended to the existing table data.
3. A checkpoint tracks the processing progress so subsequent runs resume from where the last run left off.

Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage, or local file system). The checkpoint directory is configured via the `storage` field in the pipeline spec file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage, or local file system). The checkpoint directory is configured via the `storage` field in the pipeline spec file.
Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., local file system, HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage). The checkpoint directory is configured via the `storage` field in the pipeline spec file.


### Full Refresh

You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options. A full refresh:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options. A full refresh:
You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options, respectively. A full refresh:

@jaceklaskowski
Copy link
Copy Markdown
Contributor

Please add [SDP] tag to the title 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants